[code_review] Misc improvements part 4 #5588

suhaibmujahid · 2026-01-05T02:51:41Z

These improvements could be reviewed commit by commit.

We can have better tracking with W&B Weave

…mponents

Introduces a run_by_diff_id method that retrieves a patch by diff ID from review_data and runs the review process.

Eliminated the abstract version property from GenerativeModelTool and removed the version attribute from CodeReviewTool since it is not used.

Moved suggestion filtering logic from code_review/agent.py to a new suggestion_filtering module. Introduced SuggestionFilteringTool for filtering review comments, updated CodeReviewTool to use the new filterer, and relocated related prompt templates. This improves modularity and separation of concerns for suggestion filtering.

Wrapped comments and rejected examples in <comments-to-filter> and <rejected-examples> tags to improve prompt structure and clarity.

It will be replaced with W&B Weave evaluation pipeline

Copilot

Pull request overview

This PR refactors the code review evaluation infrastructure by replacing old script-based evaluation tools with a more modular architecture and W&B Weave integration for tracking evaluations.

Changes:

Removes legacy evaluation scripts (code_review_tool_evaluator.py, code_review_tool_evaluator_report.py) and experimental files
Introduces new modular tools for patch summarization, suggestion filtering, and comment matching
Adds Jupyter notebooks for dataset creation and evaluation using W&B Weave
Refactors CodeReviewTool to use Protocol-based dependency injection for better testability
Updates platform base classes to accept both str and int for patch_id parameters

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
scripts/code_review_tool_evaluator_report.py	Removed legacy evaluation report generator
scripts/code_review_tool_evaluator.py	Removed legacy evaluation script (613 lines)
experiments/review_helper_modify_filtering_step.ipy	Removed experimental filtering modification script
requirements.txt	Added weave>=0.50.0 for evaluation tracking
notebooks/code_review_evaluation.ipynb	New notebook for running W&B Weave evaluations
notebooks/code_review_create_dataset.ipynb	New notebook for creating evaluation datasets
bugbug/tools/suggestion_filtering/prompts.py	Extracted filtering prompts to dedicated module
bugbug/tools/suggestion_filtering/agent.py	New modular suggestion filtering tool
bugbug/tools/patch_summarization/prompts.py	Extracted summarization prompts to dedicated module
bugbug/tools/patch_summarization/agent.py	New modular patch summarization tool
bugbug/tools/comment_matching/prompts.py	New prompts for LLM-based comment matching
bugbug/tools/comment_matching/agent.py	New tool for matching generated vs ground truth comments
bugbug/tools/code_review/scorer.py	New Weave scorers for evaluation metrics
bugbug/tools/code_review/utils.py	Refactored to work with structured comment objects
bugbug/tools/code_review/prompts.py	Removed prompts moved to specialized modules
bugbug/tools/code_review/agent.py	Refactored to use Protocol-based dependencies
bugbug/tools/base.py	Simplified by removing version property and print method
bugbug/tools/core/platforms/base.py	Updated signature to accept str or int for patch_id
bugbug/tools/core/platforms/phabricator.py	Updated signature to accept str or int for patch_id
bugbug/tools/core/platforms/swarm.py	Updated signature to accept str or int for patch_id
bugbug/code_search/searchfox_api.py	Made get_file parameter optional with default implementation
bugbug/code_search/mozilla.py	Made get_file parameter optional with fallback

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-10T22:41:02Z

bugbug/tools/comment_matching/agent.py

+            system_prompt=SYSTEM_PROMPT,
+            response_format=ProviderStrategy(AgentResponse),
+        )
+


The create method is missing the @staticmethod decorator. This method should be decorated with @staticmethod to match the pattern used in other similar factory methods in the codebase (e.g., PatchSummarizationTool.create(), SuggestionFilteringTool.create()).

Suggested change

@staticmethod

Copilot · 2026-01-10T22:41:03Z

bugbug/tools/suggestion_filtering/agent.py

+from pydantic import BaseModel, Field
+
+from bugbug.tools.base import GenerativeModelTool
+from bugbug.tools.code_review.agent import GeneratedReviewComment


Potential circular import issue: this module imports GeneratedReviewComment from bugbug.tools.code_review.agent, while bugbug.tools.code_review.agent imports SuggestionFilteringTool from this module. This circular dependency could cause import errors. Consider moving GeneratedReviewComment to a shared data types module or using TYPE_CHECKING with forward references.

Copilot · 2026-01-10T22:41:03Z

bugbug/tools/code_review/agent.py


        return result["structured_response"].comments

+    def run_by_diff_id(self, diff_id: str) -> list[InlineComment] | None:


The parameter type for diff_id should be str | int to match the signature of get_patch_by_id which was updated in the base class to accept both str and int. This inconsistency could cause type checking issues.

Suggested change

def run_by_diff_id(self, diff_id: str) -> list[InlineComment] | None:

def run_by_diff_id(self, diff_id: str | int) -> list[InlineComment] | None:

Copilot · 2026-01-10T22:41:03Z

bugbug/tools/patch_summarization/agent.py

+from bugbug.tools.patch_summarization.prompts import PROMPT_TEMPLATE_SUMMARIZATION
+
+
+class PatchSummarizationTool(GenerativeModelTool):


The PatchSummarizationTool class is missing proper license header comments. While the file has the MPL license header, the class itself should maintain consistency with other similar tool classes that have docstrings explaining their purpose.

Suggested change

class PatchSummarizationTool(GenerativeModelTool):

class PatchSummarizationTool(GenerativeModelTool):

"""Tool for generating natural-language summaries of code patches.

This tool uses a chat-based large language model to summarize a given

patch set in the context of the target software, producing a concise

explanation suitable for reviewers or other stakeholders.

"""

Copilot · 2026-01-10T22:41:04Z

bugbug/tools/comment_matching/agent.py

+
+    def run(
+        self, old_comments: list[MatchingComment], new_comments: list[MatchingComment]
+    ) -> list[CommentMatch]:


The run method lacks a docstring explaining its parameters and return value. This is especially important for public API methods. Similar methods in other tool classes (e.g., SuggestionFilteringTool.run) have comprehensive docstrings.

Suggested change

) -> list[CommentMatch]:

) -> list[CommentMatch]:

"""

Find matching code review comments between an old and a new set.

This method sends the provided comments to the underlying language model

agent, which identifies pairs of comments that correspond to each other

across the two versions.

Parameters

----------

old_comments:

A list of :class:`MatchingComment` instances representing comments

from the old set (e.g., an earlier revision of a code review).

new_comments:

A list of :class:`MatchingComment` instances representing comments

from the new set (e.g., a later revision of the same code review).

Returns

-------

list[CommentMatch]

A list of :class:`CommentMatch` objects, each describing a matched

pair of comments between the old and new sets. If either

``old_comments`` or ``new_comments`` is empty, an empty list is

returned.

"""

Copilot · 2026-01-10T22:41:05Z

bugbug/tools/patch_summarization/prompts.py

@@ -0,0 +1,25 @@
+PROMPT_TEMPLATE_SUMMARIZATION = """You are an expert reviewer for {target_software}, with experience on source code reviews.
+
+Please, analyze the code provided and report a summarization about the new changes; for that, focus on the coded added represented by lines that start with "+".


There's a typo in the prompt text. "the coded added" should be "the code added".

Suggested change

Please, analyze the code provided and report a summarization about the new changes; for that, focus on the coded added represented by lines that start with "+".

Please, analyze the code provided and report a summarization about the new changes; for that, focus on the code added represented by lines that start with "+".

Copilot · 2026-01-10T22:41:05Z

bugbug/tools/comment_matching/agent.py

@@ -0,0 +1,62 @@
+from langchain.agents import create_agent


The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.

Copilot · 2026-01-10T22:41:06Z

bugbug/tools/patch_summarization/agent.py

@@ -0,0 +1,48 @@
+from langchain.agents import create_agent


The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.

Copilot · 2026-01-10T22:41:06Z

bugbug/tools/suggestion_filtering/agent.py

@@ -0,0 +1,156 @@
+# -*- coding: utf-8 -*-


The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.

Copilot · 2026-01-10T22:41:06Z

bugbug/tools/comment_matching/agent.py

+    def create(**kwargs):
+        if "llm" not in kwargs:
+            from bugbug.tools.core.llms import create_openai_llm
+
+            kwargs["llm"] = create_openai_llm()
+
+        return CommentMatchingTool(**kwargs)


Normal methods should have at least one parameter (the first of which should be 'self').

Suggested change

def create(**kwargs):

if "llm" not in kwargs:

from bugbug.tools.core.llms import create_openai_llm

kwargs["llm"] = create_openai_llm()

return CommentMatchingTool(**kwargs)

@classmethod

def create(cls, **kwargs):

if "llm" not in kwargs:

from bugbug.tools.core.llms import create_openai_llm

kwargs["llm"] = create_openai_llm()

return cls(**kwargs)

suhaibmujahid added 14 commits December 24, 2025 14:10

Remove unused _print_answer method and related calls

2fb736c

We can have better tracking with W&B Weave

Refactor FunctionSearchMozilla file retrieval logic

9fe827b

Improve the factory method to create a CodeReviewTool with default co…

bd85288

…mponents

Add run_by_diff_id method to CodeReviewTool

654c436

Introduces a run_by_diff_id method that retrieves a patch by diff ID from review_data and runs the review process.

Support providing the patch summery when generating suggestions

c2ba015

Update get_patch_by_id to accept str or int patch_id

afbe930

Remove version property from GenerativeModelTool

38954ad

Eliminated the abstract version property from GenerativeModelTool and removed the version attribute from CodeReviewTool since it is not used.

Refactor patch summarization into separate tool

002475a

Add XML-like tags to filtering prompt template

051b20d

Wrapped comments and rejected examples in <comments-to-filter> and <rejected-examples> tags to improve prompt structure and clarity.

Refactor filtering to return indices instead of full comments

6385426

Add comment matching tool

6b00794

Remove code review evaluation scripts

7f1400f

It will be replaced with W&B Weave evaluation pipeline

Add code review evaluation pipeline

7e1ca6b

suhaibmujahid marked this pull request as ready for review January 10, 2026 22:34

suhaibmujahid requested review from Copilot and marco-c January 10, 2026 22:34

Copilot started reviewing on behalf of suhaibmujahid January 10, 2026 22:35 View session

Copilot AI reviewed Jan 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[code_review] Misc improvements part 4 #5588

[code_review] Misc improvements part 4 #5588

Uh oh!

suhaibmujahid commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Copilot AI Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		return result["structured_response"].comments

		def run_by_diff_id(self, diff_id: str) -> list[InlineComment] \| None:

	def run_by_diff_id(self, diff_id: str) -> list[InlineComment] \| None:
	def run_by_diff_id(self, diff_id: str \| int) -> list[InlineComment] \| None:

		from bugbug.tools.patch_summarization.prompts import PROMPT_TEMPLATE_SUMMARIZATION


		class PatchSummarizationTool(GenerativeModelTool):

-class PatchSummarizationTool(GenerativeModelTool):
+class PatchSummarizationTool(GenerativeModelTool):
+    """Tool for generating natural-language summaries of code patches.
+    This tool uses a chat-based large language model to summarize a given
+    patch set in the context of the target software, producing a concise
+    explanation suitable for reviewers or other stakeholders.
+    """

-    ) -> list[CommentMatch]:
+    ) -> list[CommentMatch]:
+        """
+        Find matching code review comments between an old and a new set.
+        This method sends the provided comments to the underlying language model
+        agent, which identifies pairs of comments that correspond to each other
+        across the two versions.
+        Parameters
+        ----------
+        old_comments:
+            A list of :class:`MatchingComment` instances representing comments
+            from the old set (e.g., an earlier revision of a code review).
+        new_comments:
+            A list of :class:`MatchingComment` instances representing comments
+            from the new set (e.g., a later revision of the same code review).
+        Returns
+        -------
+        list[CommentMatch]
+            A list of :class:`CommentMatch` objects, each describing a matched
+            pair of comments between the old and new sets. If either
+            ``old_comments`` or ``new_comments`` is empty, an empty list is
+            returned.
+        """

		@@ -0,0 +1,25 @@
		PROMPT_TEMPLATE_SUMMARIZATION = """You are an expert reviewer for {target_software}, with experience on source code reviews.

		Please, analyze the code provided and report a summarization about the new changes; for that, focus on the coded added represented by lines that start with "+".

[code_review] Misc improvements part 4 #5588

Are you sure you want to change the base?

[code_review] Misc improvements part 4 #5588

Uh oh!

Conversation

suhaibmujahid commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant